# calculate entropy of data | True model(ps <-dnorm(y, mean=mu, sd=sigma) )
[1] 0.1972397 0.0219918 0.1209854 0.1841351
-sum(ps*log(ps))
[1] 0.9712345
Let’s fit some two simple models
# fit model with just a meanm0 <-quap(alist( y ~dnorm(mu, sigma), mu ~dnorm(5, 3), sigma ~dexp(1) ), data=data.frame(x,y))precis(m0)
mean sd 5.5% 94.5%
mu 10.061225 1.7454374 7.271679 12.850771
sigma 3.761666 0.9382176 2.262213 5.261119
# fit model with a mean and slopem1 <-quap(alist( y ~dnorm(mu, sigma), mu ~ a + b*x, a ~dnorm(5, 3), b ~dnorm(0,1), sigma ~dexp(1) ), data=data.frame(x,y))precis(m1)
mean sd 5.5% 94.5%
a 4.681483 1.4027937 2.4395479 6.923418
b 1.285315 0.2185142 0.9360866 1.634542
sigma 1.576934 0.4491628 0.8590849 2.294783
Let’s fit some two simple models
Cross entropy from using m0 to approximate Truth
\[
H(p, q) = -\sum_{i=1}^n p_i \log(q_i)
\]
## cross entropy(qs <-dnorm(y, mean=preds_m0[1:4], # probs if we use m0sd=mean(extract.samples(m0)$sigma)) )
[1] 0.02639825 0.06654662 0.05294836 0.02802186
-sum(ps*log(qs))
[1] 1.790202
# added entropy by using m0 to approximate True-sum(ps*log(qs)) --sum(ps*log(ps))
[1] 0.8189678
Cross entropy from using m1 to approximate Truth
\[
H(p, q) = -\sum_{i=1}^n p_i \log(q_i)
\]
## cross entropy(rs <-dnorm(y, mean=preds_m1[1:4], # probs if we use m1sd=mean(extract.samples(m1)$sigma)) )
[1] 0.09555285 0.06663565 0.22113604 0.17771967
-sum(ps*log(rs))
[1] 1.023365
# added entropy by using m1 to approximate True-sum(ps*log(rs)) --sum(ps*log(ps))
[1] 0.05213056
Kullback-Leibler divergence
\[
D_{KL}(p,q) = \sum_{i=1}^n p_i\left[ \log(p_i) - \log(q_i) \right]
\] measures the added entropy from using a model to approximate True
# added entropy by using m0 to approximate True-sum(ps*log(qs)) --sum(ps*log(ps))
[1] 0.8189678
## Dkl(p,q)sum(ps*(log(ps)-log(qs)))
[1] 0.8189678
## --> it's the same!
## added entropy by using m1 to approximate true-sum(ps*log(rs)) --sum(ps*log(ps))
# Difference in KL distances between m0 and m1sum(ps*(log(ps)-log(qs))) -sum(ps*(log(ps)-log(rs)))
[1] 0.7668372
# We can get the same result if # we ignore the first log(ps) term in both quantities-sum(ps*log(qs)) --sum(ps*log(rs))
[1] 0.7668372
What if we do not know the Truth?
We almost have all of the \(p_i\) (ps) out of the quantity, but not quite
If we take out it out completely, we end up with log-probability scores, which are just unstandardized
\[
\begin{align}
D_{KL}(p,q) - D_{KL}(p,r) & = \sum_{i=1}^n p_i\left[ - \log(q_i) \right] - \sum_{i=1}^n p_i\left[ - \log(r_i) \right] \\
& \propto \sum -log(q_i) - \sum -log(r_i)
\end{align}
\] So we use log probabilities to describe fit and compare them between models <phew!>
But not quite: lppd
Have been pretending a single value for our expectations (the MAP) In actuality, have a full distribution (posterior). –> log pointwise predictive density